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Background: Recent experimental and theoretical studies have revealed that protein folding 
kinetics can be quite complex and diverse depending on various factors such as size of the protein 
sequence and external conditions. For example, some proteins fold apparently in a kinetically two 
state manner whereas others follow complex routes to the native state. We have set out to 
provide the theoretical basis for understanding the diverse behavior seen in the refolding kinetics 
of proteins in terms of properties that are intrinsic to the sequence. 

Results: The folding kinetics of a number of sequences for off-lattice continuum models of 
proteins is studied using Langevin simulations at two different values of the friction coefficient. 
We show for these models that there is a remarkable correlation between folding times, tf, and 
a = {Tq — Tf)/Tq, where Tq and Tp are the equilibrium collapse and folding transition 
temperatures, respectively. The microscopic dynamics reveals that several scenarios for the 
kinetics of refolding arise depending on the range of values of a. For relatively small a the chain 
reaches the native conformation by a direct native conformation nucleation collapse (NCNC) 
mechanism without being trapped in any detectable intermediates. For moderate and large 
values of a the kinetics is described by the kinetic partitioning mechanism (KPM) according to 
which a fraction of molecules $ (kinetic partition factor) reaches the native conformation via the 
NCNC mechanism. The remaining fraction attains the native state by off-pathway processes that 
involve trapping in several misfolded structures. The rate determining step in the off-pathway 



processes is the transition from the misfolded structures to the native state. The partition factor 
$ is also determined by a: smaller the value of a larger is <£. The qualitative aspects of our 
results are found to be independent of the friction coefficient. The simulation results and 
theoretical arguments are used to obtain estimates for time scales for folding via the NCNC 
mechanism in small proteins, those with less than about 70 amino acid residues. 

Conclusions: We have shown that the various scenarios for folding of proteins, and possibly 
other biomolecules, can be classified solely in terms of a. Proteins with small values of a reach 
the native conformation via a nucleation collapse mechanism and their energy landscape is 
characterized by having one dominant native basin of attraction (NBA). On the other hand 
proteins with large a get trapped in competing basins of attraction (CBA) in which they adopt 
misfolded structures. Only a small fraction of molecules access the native state rapidly, when a is 
large. For these sequences the majority of the molecules approach the native state by a three 
stage multipathway mechanism, in which the rate determining step involves a transition from one 
of the CBA's to the NBA. 

Key words: collapse and folding transition temperature, kinetic partitioning mechanism, 
native conformation nucleation collapse, protein folding, three-stage multipathway mechanism. 



I. INTRODUCTION 

It has become clear over the last few years that the study of minimal models has given 
rise to a novel theoretical understanding of the kinetics of protein folding [|T]-§[ . The general 
scenarios that have emerged from these studies are starting to be confirmed experimentally 
P HL4] , |33| . In particular, there is now some experimental support flllfnj for the kinetic par- 



titioning mechanism (KPM) first described using minimal off-lattice models |5|,|15|)|16| . The 
principles emerging from these studies have also been used to predict the folding pathways 
and the nature of kinetic intermediates in specific proteins. For example, it was shown that 
the single disulfide intermediate 14-38 in bovine pancreatic trypsin inhibitor, which denotes 
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that the structure of this intermediate contains a covalent disulfide bond between cysteines 
at location 14 and 38, forms early and decays before other more stable single intermediates 



start to form |T7[. This theoretical prediction has been subsequently verified experimentally 



fl"8|| . The theoretical studies pf-[^,pJ|1 have in fact provided, perhaps for the first time, a 
firm basis for understanding and predicting the overall scenarios that can arise in in vitro 
refolding kinetics of proteins. Since the general scenarios for refolding kinetics have been 
understood from a qualitative viewpoint it is of interest to correlate in a quantitative manner 
the dependence of folding times for a number of sequences in terms of parameters that can 
be experimentally measured. In a recent paper theoretical arguments were used to provide 
quantitative estimate of some of the important time scales that arise naturally according to 
the KPM [p2| . In this paper computational studies are used to complement the previous 



work. We should note that Onuchic et al. have also initiated complementary approaches to 
understand in a quantitative manner the folding kinetics of small a- helical proteins |23| . 

In our earlier work we showed using lattice models that the foldability of proteins (namely, 
the ability of a sequence with a unique native state to access it in finite time scales under 
folding conditions) can be understood in terms of two characteristic thermodynamic temper- 
atures which are intrinsic to the sequence [fM|,p5]]. One of them is Tg, the collapse transition 



temperature, at which there is a transition from a random coil to an almost compact state. 
The other is the folding transition temperature, Tp, below which the polypeptide chain is 
predominantly in the native conformation. In the biochemical literature Tp is roughly the 
melting temperature, T m . It was established for a variety of sequences (in both two and 
three dimensions) that the folding time correlates extremely well with a |p4] , p5|l , where 
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Based on very general arguments |22[ it can be shown that Tp < Tg, hence < a < 1. Both 
Tp and Tg are sensitive functions of not only the sequence but also the external conditions. 
This can be verified experimentally and data in the literature in fact support this obvious 
result. For example, Alexander et al. have shown for the IgG binding protein that Tp (in 



their notation T m ) varies linearly with pH [p6[| . These authors have also determined Tg for 
two forms of IgG binding protein. Thus, foldability of sequences and the associated kinetics 
for a given sequence can be altered by changing the external conditions. The major purpose 
of this article is to explore the folding kinetics as a function of a using off-lattice simplified 
models of polypeptide chains. In addition, we provide detailed analysis of folding kinetics for 
a number of sequences at two values of the friction coefficient to assess the role of viscosity 
on the qualitative aspects of the folding scenarios. 

The physical reason for expecting that a would control the folding rates in proteins is 
the following. If o is small, then Tg w Tp and hence all the conformations that are sampled 
at T ^ Tp have relatively high free energy. Any barrier that may exist between these high 
free energy mobile conformations can be overcome easily provided the temperature at which 
folding occurs is not too low. Thus, for small a one can in principle fold a polypeptide chain 
at a relatively high temperature (in the range where collapse and the acquisition of the native 
conformation are almost synchronous) and access the native conformation rapidly. For these 



cases the folding process would appear to be kinetically two state like |20[| . Furthermore, 
studies based on lattice models suggest that for sequences with small and moderate values 
of a the kinetic accessibility of the native conformation together with its thermodynamic 



stability can be achieved over a relatively broad temperature range pq . On the other hand 
when a ~ 1, then Tp <C Tg and in this case the folding process would inevitably be affected 
by kinetic traps and misfolded structures. Since some of these misfolded structures can have 



many elements in common with the native structure they can be fairly stable |15|Jlq] . Since 
T F is low for sequences with large a these stable structures could have long lifetimes even 
if the free energy barriers separating them and the native state are only moderate. Thus, 
it is likely that sequences with a ~ 1 are in general not foldable on biologically relevant 
time scales. These expectations are borne out in this study and a quantitative relationship 
between folding rates and a is given. The energy landscape perspective can be used to argue 
that small values of a correspond to the native state having a large native basin of attraction 
(NBA) || or funnel @,|,|27|. 



Before we close this introduction a brief comment on the use of minimal models to un- 
derstand folding kinetics is pertinent. This is especially important because their utility in 
getting insights into protein folding kinetics has been questioned ||28|| . The minimal mod- 
els do not explicitly contain all the features that are known to be important in imparting 
stability to proteins. However, there are many aspects of the minimal models that mimic 
the dominant interactions in proteins [l]. These involve chain connectivity, hydrophobicity 
as the driving force, and sequence heterogeneity. In addition, off-lattice models studied in 
this paper and elsewhere p9j|30l which use a realistic representation of the potentials for 
a-carbons of a polypeptide chain yield (0, ip) values consistent with the Ramachandran plot 
|3T|| . The aspects of real proteins that are not faithfully represented here are side chains and 
hydrogen bonds. Straub and Thirumalai have argued that lower order effects like stability 
arising from hydrogen bonds are included in the simplified off-lattice models of the sort con- 
sidered here in a coarse grained manner [[32|| . This is achieved by suitably renormalizing the 
dihedral angle potentials. Despite these important limitations the studies based on minimal 
models of proteins have been the only source of concrete testable theoretical predictions in 
the field of protein folding kinetics Insights based on the energy landscape picture of 

folding have lead, for example, to the microscopic picture of native conformation nucleation 
collapse (NCNC) mechanism in refolding of proteins [|l^,|l^,|l^] . Recently, experimentalists 
have begun interpreting their data on certain proteins ||33lj34| using the concept of NCNC 
mechanism. Thus, despite certain limitations these studies have already offered considerable 
insights into the folding kinetics of biomolecules in both in vitro and in vivo. 

Since this article is rather lengthy we provide a brief roadmap to the rest of the article. 
The details of the model and simulation methods are given in Sec. (II) and (III), respectively. 
The main results are presented in Sec. (IV) and (V). The paper is concluded in Sec. (VI). 
The appendix contains useful formulae for obtaining time scales for the dominant nucleation 
collapse process for small proteins. Readers not interested in technical details can skip Sec. 
(II) and (III) entirely. 
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II. DESCRIPTION OF THE MODEL 



The model used in our simulations is a variant of the one introduced in our previous 



studies [T^|T(|. We use continuum minimal model representation of a polypeptide chain. 
In these classes of models only the principle features of proteins responsible for imparting 
stability are retained. These include hydrophobic forces, excluded volume interactions, bond 
angle and dihedral angle degrees of freedom. The simplified model can be thought of as a 
coarse grained representation containing only the a-carbons of the protein molecule. The 
polypeptide is modeled as a chain consisting of N connected beads with each corresponding 
to a set of particular a-carbons in a real protein. In order to simplify the force field we 
assume that sequence is essentially built from residues of three types, hydrophobic (B), 
hydrophilic (L), and neutral (N). Our previous studies have established that the three 
letter code can be used to construct the basic structural motifs in proteins, namely, a-helix 
and /9-turns [ p^|3T1 , |35f . In this study we mimic the diversity in the hydrophobic residues in 
proteins using a dispersion in the interactions between B residues (see below). 

The potential energy of a conformation, which is specified by the set of vectors 
2 = 1, 2...N, is taken to be of the following form 

E p ({fi}) = Vbl + V BA + V DIH + V NON (2) 

where Vbl, Vba, Vdih, and Vnon correspond to bond- length potential, bond-angle potential, 
dihedral angle potential, and non-bonded potential, respectively. A brief summary of these 
interactions is given below. 

(a) Bond-length potential. In our previous studies we assumed that the length of 
the covalent bond connecting the successive beads to be fixed. The constraint of fixed bond 
length, which was enforced using the RATTLE algorithm H37H , proves to be computation- 
ally demanding. In the present study we use a stiff harmonic potential between successive 

residues, which keeps the bond length approximately fixed, i.e., 

N-i kr 

v bl = y(l^+i -a) 2 , (3) 



where k r = lOOe^/a 2 , a is the average bond length between two beads, and e^, the average 
strength of the hydrophobic interaction, is the unit of energy in our model. We have verified 
that using the potential in Eq. (|3|) gives the same results for the sequence that has been 
previously studied |15|| . 

(b) Bond-angle potential. The potential for the bending degrees of freedom, describ- 
ing the angle between three successive beads i, i + 1, i + 2 is taken to be 

N ' 2 ha 

v BA =Y.i^-e,)\ (4) 
i=i z 

where k e = 20e h /(rad) 2 and 9 = 1.83 26 rad or 105°. 

(c) Dihedral angle potential. This potential describes the ease of rotation around the 
angle formed between four consequent beads. This degree of freedom is largely responsible 
in determining secondary structures in a polypeptide chain ||38|| . The i th dihedral angle 4>i is 
formed between vectors Hi = (r i+lji x r i+lti+2 ) and H i+1 = (r i+2 ,i+i X r i+2> i+3), i-e., it is the 
angle between the plane defined by beads i, i + 1, i + 2 and the one spanned by beads i + 1, 
i + 2, i + 3. The vector r i>i+ i = r i+1 — fj. The general form of the potential describing the 
dihedral angle degrees of freedom is well known [|39| and can be represented as 

N-3 

Vdih = E W + cos (Pi) + Bi(l + cos 30,)] (5) 

i=l 

If two or more of the four beads in defining 0j are neutral (N) the A{ and Bi are taken 
to be Oeh and 0.2eh, respectively. For all other cases Ai = Bi = 1.26^. For the larger 
values of Ai and Bi the trans state is preferred and this leads to the formation of extended 
conformation. The presence of neutral residues, which are introduced so that loop formation 
is facilitated, has the effect of decreasing the barrier and energetic differences between the 
trans and gauche states fl35| . 

(d) Non-Bonded potential. The non-bonded potentials arise between pairs of residues 
that are not covalently bonded. These forces together with those arising from the dihedral 
angle degrees of freedom (which provide favorable local interactions for the formation of 
secondary structures) are responsible for the overall formation of the three dimensional 
topology of the polypeptide chain. 
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We take simple forms to represent the non-bonded interaction terms. We assume that 
the effective potential describing the interaction between the residues i and j (\i — j\ > 3) 
depends on the type of residues involved. The total non-bonded potential is written as 



N-3 N 



NON 



E E * 



(6) 



i=l j=i+3 

where r = \ fi — fj\. The potential between two L-beads or between a (L, B) pair is taken to 
be 



V La (r) = 4e, 



12 



+ 



[a 



L or B) 



(7) 



where ei = fe^. This potential is purely repulsive with a value of 2e h at r = 2 1//6 a, which is 
the location of the minimum in the hydrophobic potential (see Eq. (|9])). The presence of r~ 6 
term gives rise to a potential that is longer ranged than the usual r -12 term. The additional 
term may be interpreted to arise from the hydration shells around the hydrophilic residues. 
The interaction between the neutral residues and the others is expressed as 



V N Jr) = 4e„( - 



12 



(a = N, L, or B). 



(8) 



If both the residues are hydrophobic (B) the potential of interaction is taken to be 



VWr) =4Ae 7l 



12 



(9) 



where determines the strength of the hydrophobic interaction. The above form for Vbb{t) 
can be thought of as approximate representation (capturing the primary minimum) of the 



potential of mean force between spherical hydrophobic spheres in water [36 



The dimensionless parameter A is assumed to have a Gaussian distribution 

1 / (A-A ) 2 ' 



P(A) 



(2ttA 2 ) 1 /2 exp V 2A 2 

The mean value of Ao = 1. The introduction of the distribution in the strength of the 
hydrophobic interaction creates diversity among hydrophobic species, and hence provides a 
better caricature of proteins. The standard deviation A controls the degree of heterogeneity 



(10) 



s 



of the hydrophobic interactions and if its value becomes too large then the unambiguous 
division of residues into three distinct types becomes difficult. Consequently, we keep the 
value of A small enough so that the prefactor 4Xeh is in general greater than 4sl (see 
Eqs. ([7|,|9]) )• This also assumes that the interaction between hydrophobic residues remains 
attractive at the separation corresponding to Lennard- Jones minimum. For large values of 
A (not used in our present study) the distribution function in Eq. ( |10|) has to be truncated 
at some positive value of A so that the Ae^ does not become negative. 

In our earlier studies |l5| , p!6| , |35| we used A = and hence all B beads were identical. 
Thus, our previous studies correspond exactly to a three letter code. The current potential 
function with random hydrophobic interaction gives more specificity to the interactions, 
and yet preserves the overall hydrophobic interactions as the driving force for structure 
formation. 



III. SIMULATION METHODS 

A. Langevin dynamics 

Following our earlier work we have used Langevin dynamics for simulating folding kinetics 
5j. We include a damping term in the equation of motion with a properly chosen friction 
coefficient ( and the Gaussian random force to balance the energy dissipation caused by 
friction. The equation of motion written for the generalized coordinate x is given by 

mi = -(x + F c + T = F (11) 

where F c = — -^f is the conformation force, which is a negative gradient of potential energy 
with respect to the coordinate x, T is the random force having a white noise spectrum, and 
m is the mass of a bead. The equation of motion Eq. ([n]) is numerically integrated using 



the velocity form of the Verlet algorithm [40]. If the integration step is h, the position of a 



bead at the time t + h is expressed through the second order in h as 
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h 2 

x(t + h) = x(t) + hx(t) + — F(t). (12) 

2m 

Similarly, the velocity x(t + h) at the time £ + /i is given by 

K J V 2m/ V 2m V2m/ J w 2m V 2m V2m/ J 

x (F c (t) + r(t) + F c (£ + fc) + r(t + fc)) + o(^ 2 )- (13) 

Because we assume that the random force V has a white noise spectrum, the autocorrelation 
function < T(t)T(t') > is expressed in the form 

< T(t)T(t') >= 2(k B T5(t - t'). (14) 

Since the equation of motion Eq. ( ]TT| ) is discretized and solved numerically, this formula 
can be rewritten as 

< T(t)T(t + nh) >= ^j^8o,n, (15) 

where 5o in is the Kronecker delta and n = 0,1,2.... Thus, in the context of this model 
changing the temperature of the system essentially means changing the standard variance 
in the Gaussian distribution of the random force T. 

Temperature is measured in the units of eh/ks- In the underdamped limit, i.e. when £x 
is negligible compared to the inertial term in the equation of motion a natural choice 
of the unit of time is tl = (ma 2 / e^) 1 / 2 . The simulations have been done in low to moderate 
friction limit which in the rate theory of reactions would correspond to the energy diffusion 
regime. The integration step used in the equation of motion is taken to be h = 0.005t£. 
All the sequences were studied at two values of the friction coefficient ( L = O.OSmr^ 1 and 
Cm = IOO^l = hmrj^ 1 . The relation between the time unit tl and the folding time scales in 
real proteins as well as the range of ( used in this study are discussed in the Appendix. The 
Appendix also gives estimates for certain time scales in the folding kinetics of proteins using 
the simulation results and theoretical arguments. In our simulations the mass of residue m, 
the bond length a, the hydrophobic energy constant e/,, and the Boltzmann constant kg are 
set to unity. 
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B. Determination of native conformation 



For each sequence we determined the native conformation by adapting the procedure 
similar to that used in our recent work on lattice model of proteins [[24|. The database of 



sequences generated is described in Sec. (III.D). As in our earlier works [15,35] we have used 
a combination of slow cooling and simulated annealing to determine the native conformation. 
The chain is initially heated to T = 5.0 and equilibrated at this temperature for 2000tl. The 
temperature is then quenched to T = 1.0 and the chain is reequilibrated for an additional 
2000r£. This process of quenching the chain from T = 5.0 to T = 1.0 was repeated several 
times so that we generated a set of independent conformations at T = 1.0. These structures 
are used as starting conformations for the slow cooling process. In order to ascertain that the 
starting conformations are independent the overlap between a pair of these conformations 
(see Eq. ([[7])) averaged over all distinct pairs denoted by % is calculated. This yields % ~ 0.9 
that roughly corresponds to the value of x f° r a P arr of randomly generated conformations. 
The temperature of the system in the simulations starting from one of the well equilibrated 
conformations at T = 1.0, is slowly decreased to T = 0.0 (see Sec. (III.C) for details). 
In the process of reaching T = 0.0 the energies of the conformations are recorded. This 
process is repeated for several (typically 10) initial conformations. The conformation with 
the lowest energy is assumed to be the native state for the sequence. After determining the 
native conformation by this method we raised the temperature from 0.0 to 0.2 in IOOOtl 
and then lowered it to T = 0.0, i.e. we performed a simple simulated annealing procedure. 
In all instances the resulting structure and the energy coincided with those obtained by the 
slow cooling protocol. It should be emphasized that this method cannot guarantee that the 
structures are indeed global energy minima. However, the determination of native structures 



for these sequences by other optimization techniques leads to the same structures (41| . Thus, 
we are fairly certain that the structures found by this method indeed are the lowest energy 
structures for our model. 
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C. Thermodynamic properties 



We and others have shown that each foldable sequence is characterized by two natural 
temperatures |l|J^,^ , 15 , 23] . One of them is Tg below which the chain adopts more or less 
compact conformation. The transition at Tg is (usually) second order in character suitably 
modified by a finite size effects. Following our earlier studies Tg is located by determining 
the temperature dependence of the heat capacity [T^J2^f2^] 

C. = < E2 >-< E >\ (16) 

The location of the peak in C v is taken to be Tg. Previous studies have shown that at T « Tg 
the radius of gyration changes dramatically reaching a value roughly coinciding with that 
for a compact conformation fll5"l,|35] . This, of course, is usually taken to be a signature of 
"collapse" transition in homopolymers fl42 |. 



The second crucial temperature is the folding transition temperature Tp. There are 
several ways of calculating Tp all of which seem to give roughly similar estimates [f7,|2~5,47 



We use the fluctuations in the structural overlap function to estimate Tp. The structural 
overlap function is defined as 

9 N-3 N 

where is the distance between the beads % and j for a given conformation, rfj is the 
corresponding distance in the native conformation, and 0(x) is the Heavyside function. If 
\rij- < e then the beads i and j are assumed to form a native contact. In our simulations 
we take e = 0.2a. 

It follows from the definition of x t na t at finite temperatures < x{T) >, the thermal 
average, is in general non-zero. The folding transition temperature is obtained from the 
temperature dependence of the fluctuations in x 

A X =< X 2 (T) >-< X (T) > 2 (18) 
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For sequences with a unique ground state A\ exhibits a peak at T ~ Tp p5| . It has been 
shown that for these simple off-lattice models this transition is a finite size first order phase 
transition ||15|| . Our previous lattice model studies have shown that Tp obtained from the 
temperature dependence of Ax is in general slightly smaller than that calculated from the 
midpoint of < x(T) > or other suitable order parameters [J3|. 



The thermodynamic properties like < xCO >> total energy < E(T) > (the sum of 
kinetic and potential energies) etc. are calculated using time averages over sufficiently long 
trajectories. The trajectories which are generated in the search for the native structure can 
be used to get an approximate estimate of the temperature interval (T^T/J, which includes 
the temperatures Tg and Tp. In all cases we set = 1.0, while T} varies from 0.3 to 
0.4. Each trajectory starts with the same zigzag initial conformation. The chain is then 
heated at T = 5.0 and brought to equilibrium at T = 1.0 as described in Sec. (III.B). 
The method of slow cooling employed for calculating thermodynamic averages is identical 
to that presented in |[43| . The system is periodically cooled (starting at the temperature 
T/j) by an amount AT. The time T max is the time of running the simulations at the fixed 
temperature, Tj = T^ — iAT, where i = 0,1,2.... In this study we have set AT = 0.02, 
the time T max = 2500r L , and the equilibration time after the change of the temperature 
by AT to be r eq = 250tl. These values are used with most sequences within the entire 
temperature interval (T}, T^). The thermodynamic values for one particular initial condition 
i is calculated as 

1 tTeq+Tav 

fi(T) = — fi(T,t)dt, (19) 

where r av = T max — r eq . The equilibrium thermodynamic value is obtained by averaging over 
a number of initial conditions 

1 M - 

<f(T)>=^J2MT). (20) 
m i=i 

We found that M = 50 was sufficient to obtain accurate results for equilibrium properties. 
For most sequences the values of parameters (see above) used in the course of equilibration 
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were large enough to obtain converged results. In some instances the equilibration times had 
to be increased to obtain converged results. The functions < E(T) >, C V (T), and A\(T) 
are obtained by fitting the data by polynomials, and < x{T) > was fit with two hyperbolic 
tangents. 

We should note that due to the intrinsic heterogeneity of these systems non-ergodicity 



effects often manifest themselves [|15|| . If this is the case, we need to do weighted averaging 



of thermodynamic quantities, as described elsewhere |15|^4|| , to get converged results. This 
issue was not encountered for the sequences that were examined in this study. 

It is obvious that the thermodynamic quantities are independent of the underlying dy- 
namics provided the dynamics yields the Boltzmann distribution at t — > oo. Since the 
thermodynamics is only determined by the Boltzmann factor exp(— -Ep/fegT) it is conve- 
nient to determine them at low friction where the sampling of conformation space appears 



to be more efficient 15,35 



D. Database of sequences 

For our model one can, in principle, generate infinite number of sequences because of 
the continuous distribution of the effective hydrophobic interactions (see Eq. (|10|)). The 
vast majority of such sequences would be random, and hence would not fold to a unique 
native state on finite time scale. Our goal is to obtain a number of these sequences with the 
characteristic temperatures Tq and Tp such that they span a reasonable range of a (see Eq. 
(H)). It is clear that merely creating random sequences will not achieve this objective. In 
general, random sequences would take extraordinarily long times to fold. It is known that 
foldable sequences (those that reach the native state in finite times) are designed to have a 
relatively smooth energy landscape. Such sequences may, in fact, be minimally frustrated 



or have compatible long and short-range interactions p7| . Thus, in order to generate 
sequences that span the range of a and which are foldable, we used the most primitive 
design procedure in the inverse protein folding problem [E9,B0]. Because our objective is 
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not to provide the most optimal solution to the inverse folding problem the more reliable 
methods introduced recently were not utilized |Hfl. 

In all our studies the number of beads N = 22. The composition of all sequences is 
identical, namely, all of them contain 14 hydrophobic beads, 5 hydrophilic beads, and 3 
neutral beads. The sequences in this model differ from each other because of the precise 
way in which these beads are connected. In addition, due to the distribution of hydrophobic 
interactions not all hydrophobic beads are identical. The latter condition also introduces 
diversity among sequences. 

The method for creating the database of sequences is as follows. The first sequence A, 



B 9 N 3 (LB) 5 , examined has been already studied in our previous work [15||. This allows us 
to ascertain that our modified model (incorporating stiff harmonic bond-length potential 
instead of RATTLE algorithm) yields results consistent with our earlier studies. This se- 
quence has zero A, so all hydrophobic residues are identical. All other sequences (to be used 
as starting conditions for Monte Carlo optimization procedure, see below) were generated 
at random with different standard deviations A, but preserving the same composition, i.e. 
14 B beads, 5 L beads, and 3 N beads. By "random generation" of a sequence we mean 
that a sequence is randomly constructed from the beads of three types and the values of the 
parameter A specifying non-bonded interactions between hydrophobic residues in a sequence 
(Eq. @) were obtained using Gaussian distribution Eq. (|T0l) . Specifically, we used A = 
(one sequence), A = 0.1 (one sequence), A = 0.17 (one sequence), A = 0.3 (five sequences). 

The next step in creating the database of sequences is the choice of the target confor- 
mation, the precise choice of which is rather arbitrary because the only natural requirement 
is that it should be compact. However, due to intrinsic propensity toward bend formation 
near clusters of N residues, it seems reasonable to restrict the choice of target conformation 
topology to that with single U-turn. Obviously, the number of residues in a sequence N = 22 
allows us to define much more complicated topologies featuring multiple U-turns. In order 
to avoid the comparison of folding behavior of the sequences with the native conformations 
of different topologies the only criterion for selecting target conformations is that they must 
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have a single U-turn and be reasonably compact. The role of the topology of the ground 
state in determining the folding kinetics will be addressed in a future paper. Thus, using 
the conditions described above we have selected the target conformations from the database 
of low energy structures found in the course of slow cooling simulations. 

Once a target conformation is chosen Monte Carlo algorithm in sequence space PU|,|5D| is 



used to obtain an optimal sequence by means of the primitive inverse design procedure, i.e., 
the sequence that has the lowest energy and is compatible with the target conformation. 
There is no guarantee that this procedure is by any means the best way of designing optimal 
sequences as has been pointed out by Deutsch and Kurosky ||51|| . However, for our purposes 
this naive procedure suffices. The main idea is to perform small random permutations of a 
sequence while keeping its composition fixed and accepting (or rejecting) new sequences with 
respect to the Boltzmann factor P = exp(— AE/ksT), where AE is the energy variation 



due to permutation and T is the temperature of Monte Carlo optimization scheme |f49| , |50 
Hence, this algorithm is aimed at lowering the energy of the target conformation. The 
sequence providing the lowest energy at the target conformation is chosen as the desired 
sequence and used for further analysis. The control parameter which specifies the degree (or 
"quality") of optimization is the temperature T. For most sequences we have used low value 
of T = 0.2. We have made several attempts to run Monte Carlo algorithm in a sequence 
space by gradually decreasing the temperature from high values of T > 1.0 up to the very 
low values T < 0.01. This method, however, does not often provide lower energies of target 
conformations than that based on quenching the temperature at a certain value, and this is 
probably due to moderate ruggedness of the energy landscape in sequence space. It must be 
emphasized that the optimization scheme does not guarantee that the target conformation 
is actually the native state of the optimized sequence. This should be checked in the course 
of molecular dynamics simulations (see below). 

The nine sequences obtained using the procedure described above are listed in Table 1 
and are labeled A through I. Of the nine sequences eight (B-I) were generated using Monte 
Carlo method in sequence space. All sequences have different native conformations, except 
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the pairs of sequences A,D and B,C, which share the same native state. It must also be 
emphasized that although sequences F-H have identical distribution of beads they differ 
from each other with respect to the strength of hydrophobic interactions, i.e. they have 
different sets of prefactors A (Eq. fllCf)). 



E. Simulation temperature 

In order to compare the rates of folding for different sequences it is desirable to subject 
them to identical folding conditions. The equilibrium value of < x{T) > measures the 
extent to which the conformation at a given temperature T is similar to the native state. At 
sufficiently low temperature < x(T) > would approach zero, but the folding time may be 
far too long. We chose to run our folding simulations at a sequence dependent simulation 
temperature T s which is subject to two conditions: (a) T s be less than Tp for a specified 
sequence so that the native conformation has the highest occupation probability; (b) the 
value of < x(T — T s ) > be a constant for all sequences, i.e. 

< X (T = T s ) >= a. (21) 

In our simulations we choose a = 0.26 and for all the sequences studied T S /T F < 1. This 
general procedure for selecting the simulation temperatures has been already used in recent 
studies of folding kinetics using lattice models p3| , ^6| , ^7| . 



An alternative way of choosing T s is to assume that the probability of occupation of the 
native state be the same for all sequences. In our previous study [[43| on lattice models we 
have used this method for a small number of sequences. The trends in the folding times 
at the resulting simulation temperatures (as well as the kinetics) were very similar to those 
found when Eq. ([H]) is used to determine T s . 

It is also possible to keep the simulation temperature constant for all sequences. Because 
Tp and Tg vary greatly depending on sequence such a choice would not ensure that the 
probability of being in the native conformation is roughly the same for all sequences or 
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that all sequences are qualitatively similar to the same extent. In other words the folding 
conditions for the sequences would effectively be different if the temperature is held constant. 
This argument also implies that the statistical trends of folding kinetics with respect to 
intrinsic sequence dependent properties are expected to hold only over an optimal range of 
folding conditions which in simulations are entirely determined by temperature. 



F. Monitoring folding kinetics 

The simulation procedure for obtaining folding kinetics resembles the slow cooling 
method apart from the one principal difference that after heating the chain it is quenched 
to the temperature T s , defined from the condition < x(T = T s ) >= 0.26. The temperature 
is held constant after the quench to T = T s . The duration of the folding simulations de- 
pends on the rate of folding of a particular sequence and is typically on the order of 10 4 tl. 
For each sequence we generated between 100 — 300 independent trajectories. The folding 
kinetics is monitored using the fraction of trajectories P u (t) which does not reach the native 
conformation at time t 

P u (t) = 1 - f P fp {s)ds, (22) 
Jo 

where Pf p (s) is the distribution of first passage times, 

1 M 

P fv^) = j i Y.^-Tii)- (23) 

i=l 

In Eq. (|23|) Tu denotes the first passage time for the i th trajectory, i.e. the time when 
a sequence adopts native state for the first time. It is easy to show that the mean first 
passage time t mfpt to the native conformation (which is roughly the folding time, r F ) can 
be calculated as 

poo roc 
TMFPT = / tP fp {t)dt= / P u (t)dt. (24) 

Jo Jo 

The mean first passage time tmfpt can also be calculated using Tu for the M trajectories 
so that 
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^ M 

Tmfpt = jf^ni- (25) 

i=l 

We find that for all the sequences P u (t) can be adequately fit with several exponentials of 
the form 

P u (t) = $exp( —) + ]Ta fc exp(--Y (26) 

V t fast J I \ r k J 

where the sum is over the dominant misfolded structures, r k are the time scales for activated 

transition from one of the misfolded structures to the native state, and J2k = 1 — $. In 

this study we report r MFPT using Eq. fl2"4|) by fitting P u (t) according to Eq. (|26|). We have 

explicitly verified that Eq. fl24|), with P u (t) given by Eq. (|26|), and Eq. (|25|) yield practically 

identical results. 

It has been shown in a series of articles that multiexponential fit of the function P u (t) can 



be understood in terms of the kinetic partitioning mechanism [^[T5],|T6[[19| and is indicative of 
the distribution of time scales in the refolding of biomolecules. According to KPM a fraction 
of molecules $ folds to the native conformation very rapidly, while the remainder (1 — $) 
approaches the native state via a complex three stage multipathway mechanism (TSMM). 
Therefore, the time constants tfast and r k can be interpreted as the characteristic folding 
times of the fast and slow phases, respectively. When appropriate fit of the function P u (t) 
with exponentials is performed, the calculation of the mean first passage time tmfpt becomes 
straightforward. 

An alternative way to calculate folding times is based on the analysis of the time de- 
pendence of the overlap function. The overlap function \ is constantly calculated during 
simulations and its average time dependence is obtained as 

m »=l 

where Xiif) is the value of x f° r the i th trajectory at time t. We find that < x{t) > can a ls° 
be fit by a sum of exponentials (usually by one or two, see [24],p5||) 

(28) 









+ a 2 exp^— — ^ 
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Here t\ gives the estimate for the time scale of native conformation nucleation collapse 



process [I5 , |IE[] , while the largest time constant in the fit serves as an estimate of the 
folding time Tp. For most sequences biexponential fit provides the most accurate results. 
However, several sequences (typically ones with relatively small values of a) demonstrate 
a clear single exponential behavior of < x(t) >. In some cases there are additional slow 
components present on larger time scales as well. It has been shown in our previous papers 
that defining folding times using the functions P u {t) or < x(t) > yields qualitatively similar 
results fll5lj43fl . The same conclusion is valid for this model as well. Thus, tfast and the 
largest rfc in Eq. ( [2~6|) are roughly proportional to T\ and r%, respectively. 



IV. RESULTS 

A. Thermodynamic properties 

In this section we present the results on thermodynamics and kinetics of folding. Using 
the methodology described above, we studied 9 sequences, 8 of which were generated by 
performing Monte Carlo simulations in sequence space. The native conformation of each 
sequence was determined using the procedure described in Sec. (III.B). The example of 
the native conformation for the sequence G is given in Fig. (1). This picture demonstrates 
that all three neutral residues shown in grey are concentrated in the turn region. It is also 
clearly seen that hydrophobic residues shown in blue tend to be in close contact to each 
other due to their inherent attractive interaction, while hydrophilic residues shown in red 
point outwards. 

For each sequence we calculated the two characteristic equilibrium temperatures, collapse 
transition temperature Tq and folding transition temperature Tp from the temperature de- 
pendence of C v and Ax, respectively. The plots of < x > an d < E > were also obtained. 
Fig. (2) displays these functions for the sequence G. The plot of < x(T) > in Fig. (2a) 
indicates that at high temperatures 0.8 ^ T ^ 1.0 the value of < x >~ 0.8, so the chain has 
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negligible amount of the native structure. It was already shown in |35| that under these con 



ditions polypeptide chain is in a random coil conformation. The overlap function gradually 
decreases with the temperature and at T = 0.3 it reaches the value below 0.2. We do not 
plot < x > (T) for T < 0.3, because at such low temperatures it is difficult to obtain reliable 
thermodynamic averages due to non-ergodicity problems. Fortunately, this is not necessary 
for the aim of our study because all characteristic temperatures Tg, Tp and the simulation 
temperature T s are relatively high. The peak of the specific heat C v (see Fig. (2d)) which 
corresponds to the collapse transition temperature Tg is at 0.78. At this temperature protein 
undergoes a transition from an extended coil state to compact conformation. In fact, we 
calculated the radius of gyration < R 2 g > as a function of temperature for few sequences 
and found that at T « Tg it shows a sudden drop in accord with earlier and more recent 



studies |15| , |35| . However, at Tg the overlap function is still relatively large (< x >~ 0.7). 
The fluctuation of the overlap function Ax achieves a maximum at T = 0.62 and this is 
taken to be the folding temperature Tp. The value of Tp calculated from the midpoint of 
< x > (i- e -> when < x > is about 0.5) is also around 0.62. In general, we have found that T F 
obtained from the peak of Ax is slightly lower than that calculated from the temperature 



dependence of similar measures like < x > P^fl- It was demonstrated that this temperature 



corresponds to first order folding transition to the native conformation [|i~5"[p5l . By monitor- 



ing x{t) for several individual trajectories under equilibrium conditions at T ~ Tp we find 
that the protein fluctuates between the native and disordered conformations. All the nine 
sequences show similar behavior from which the various thermodynamic parameters can be 
easily extracted. The parameter a (see Eq. ([!])) for the nine sequences ranges from 0.14 to 
0.65. Thus, a meaningful correlation between the folding time and a, which is one of the 
major purposes of this study, can be established. 



The simulation temperature T s for the sequence G defined by Eq. (|2T| ) is found to be 
0.41. In Fig. (3) we present the dependence of T s on the parameter a. It is seen that 
the simulation temperature T s is a decreasing function of a. Thus, high values of T s are 
found for the sequences with small values of a, and this prompts us to anticipate that such 
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sequences are fast folders (see below). However, it was argued f|3| that this correlation must 
be viewed as statistical. This implies that if two sequences have close values of a, then a 
precise correlation with T s is not always expected. On the other hand, if a large number of 
sequences spanning a range of a is generated, then we expect a statistical correlation to hold. 
We also expect these conclusions to hold over a range of temperatures which is favorable 



for folding. The present off-lattice studies and those based on lattice models p4|j43[| confirm 
this expectation. 

B. Dependence of Tg and Tp on sequence 

One of the major results in this study (see Sec. (IV.D)) is that the folding times for 
all sequences correlate extremely well with a (cf. Eq. ([!])). Therefore, it is of interest 
to investigate how Tg and Tp vary with the sequence. It seems reasonable to assert that 
the folding temperature Tp depends rather sensitively on the precise sequence. In fact, it 
has been argued that to a reasonable approximation Tp is determined by the nature of the 



low energy spectrum (a sequence dependent property), at least in lattice models p|. The 



sensitive dependence of Tp on the sequence is explicitly confirmed in the present paper and 



in the previous lattice models as well p3| , |48| . In Table 1 we display Tp for the nine sequences. 
The values of Tp range from 0.20 to 0.62. Thus, the largest Tp is about three times larger 
than that of the smallest value. 

It might be tempting to think that Tg should be insensitive to the sequence and should 
essentially be determined by the composition of the sequence. This expectation arises es- 



pecially from heteropolymer theory [52]. According to the heteropolymer model, which 
essentially ignores short length scale details, Tg is determined by the average excluded vol- 
ume interactions, vq, and the average strength of hydrophobic interactions, Aoe^. Both these 
values are expected to be roughly constant, especially if the sequence composition is fixed. 
The determination of Tg for a polypeptide chain based on these arguments ignores surface 
terms and may, in fact, be valid in the thermodynamic limit, i.e. when the number of 
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beads tends to infinity. However, polypeptide chains are finite sized and hence the nature 
of surface residues which depends on the precise sequence are critical in the determination 
of Tg. This, in fact, is borne out in our simulations. In Table 1 we also display Tg for the 
nine sequences. Although the largest Tg is only approximately 1.5 times (as oppose to a 
factor of three for T F ) larger than that of the lowest it is clear that T e is very sequence 
dependent even though the composition for all sequence is identical . All the sequences have 
14 hydrophobic residues. Thus, both Tg and T F are determined not only by the intrinsic 
sequence but also by external conditions. In fact, Tg and T F , and, consequently, a can be 
manipulated by altering the external solvent conditions (pH, salt, etc.). It therefore follows 
that a single foldable sequence can have very different values of a depending on the solvent 
conditions and hence can exhibit very different kinetics. 

It is interesting to obtain estimates for Tg and T F using realistic values of e/i, the average 
strength of hydrophobic interaction. From Table 1 we note that the range of Tg is (0.58 — 
0.80)6^//^ with the lower values corresponding to sequences with larger a. The value of e h 
ranges from (1 — 2)kcal/mol. Assuming that e h 2kcal/mol the range of Tg is (48 — 67)°C. 
It appears that the better designed sequences (ones with smaller a values) have more realistic 
values of Tg. Similarly the range of T F for better designed sequences is (33 — 50)°C. These 
estimates suggest that optimized sequences can fold over a moderate range of temperatures 
rapidly and with relatively large yield. These expectations are explicitly demonstrated here. 

C. Kinetics of folding: The Kinetic Partitioning Mechanism (KPM) 

We studied the folding kinetics using the function P u (t), which gives the fraction of 
unfolded molecules (trajectories) at time t. We also computed the time dependence of 
< x(t) > to gain additional kinetic information concerning the approach to the native 
conformation. The function P u {t) has been obtained for each sequence at the simulation 
temperature T s from the analysis of large number of individual trajectories (M = 100 — 300) 
starting with different initial conditions. The resulting plots of P u (t) were fitted with a sum 
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of exponentials (one, two, or three) and the mean first passage time tmfpt (taken to be 
equal to t f ) for each sequence was calculated as described in Sec. (III.F). In general, it is 
found that after a short transient time P u (t) is extremely well fit by a sum of exponentials 
(cf. Eq. (|26|)). The partition factor, $, gives the fraction of molecules that reaches the native 
conformation on the time scale tfast by a NCNC mechanism and (^> tfast) being the 
time scales over which the remaining fraction 1 — $ reaches the native state |l5| , |l^ , ^] . 

Based on fairly general theoretical considerations it has been shown that o (= (Tg — 
Tp)/Tg) can be used to discriminate between fast and slow folding sequences f2~5|j2~2f . This 



has been confirmed numerically for lattice models We classify fast folding se- 

quences as those with relatively large values of $ (> 0.9). These sequences reach the native 
conformation without forming any discernible intermediates and essentially display a two 
state kinetic behavior. The plot of P u {t) for one of these sequence (sequence G) which can 
be fit with only one exponential in Eq. ( p6|) is presented in Fig. (4a). It is obvious that 
$ depends on the sequence (via a), the temperature, and other external conditions. Four 
sequences out of nine appear to be fast folders displaying a two state kinetic approach to 
the native conformation with $ > 0.9. These sequences have a values less than about 0.4. 

The other five sequences have $ values less than 0.9 and hence can be classified as 
moderate or slow folders. The values of a for these sequences exceed 0.4. The discrimination 
of sequences into slow and fast based on $ is arbitrary. An example of the kinetic behavior 
of a slow folder (sequence A) probed using P u (t) is shown in Fig. (4b). The generic behavior 
of P u (t) as a sum of several exponentials has been argued to be a consequence of the kinetic 
partitioning mechanism [|15|,[^]. Typically for slow folding sequences tfast varies from 
200Ti to 600ri, whereas the largest value of (as defined by Eq. (|26|) ) lies in the interval 
from 2500tx to 2.7 x lO 6 ^. Slow folding trajectories reach the native state via three stage 



multipathway mechanism [|15|,[^,[25[] , according to which random collapse of a protein (first 
stage) is followed by a slow search of the native state among compact conformations (second 
stage) that eventually leads the polypeptide chain to one of several misfolded structures. 
These misfolded structures have many characteristics of the native state. Generically the 
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rate determining step in the TSMM involves the transition (crossing a free energy barrier) 
from the misfolded structure to the native state (third stage) ||25|| . 

In order to obtain insights into the microscopic origins of the slow and fast phases we have 
analyzed the dynamic behavior of various trajectories. We have found that for sequences that 
find the native conformation essentially in a kinetically two state manner all the trajectories 
reach the native conformation without forming any discernible intermediates. Furthermore, 
for these cases once a certain number of contacts is established the native state is reached 
very rapidly, which is reminiscent of a nucleation process |6| Jl5|JT6[| . The time scale for 



such nucleation dominated processes is relatively short and it has been suggested that in 
these cases the collapse process and the acquisition of the native conformation occur almost 
simultaneously p2| |. It is for this reason we refer to this process as native conformation 



nucleation collapse (NCNC). This process has been referred to as nucleation condensation 
mechanism by Fersht ||10|| . These points are illustrated by examining the dynamics of the 
structural overlap function x(t) for fast folders. A typical plot for x(t) for a fast folder 
(sequence G) is shown in Fig. (5). In Fig. (5) (as well as in Figs. (6-8,15,16)) we plot x(t) 
for a trajectory labeled k averaged over a few integration steps h, i.e., 

XkW = - / Xk{s)ds. (29) 

T Jt-r/2 

The value of r = 5r^ which is much less than any relevant folding time scales. This figure 
(Fig. (5a)) shows that within 380r^ (the first passage time) the chain reaches the native 
conformation. After the chain reaches the native state there are fluctuations around the 
equilibrium value of < x > (= 0.26). Another example of folding trajectory for this sequence 
is presented in lower panel, which is further analyzed in Sec. (IV.D). 

The dynamical behavior shown in Fig. (5) for fast trajectories should be contrasted 
with the trajectories for other sequences that reach the native state by indirect off-pathway 
processes. An example of such a behavior for the moderate folder (sequence E, $ = 0.72) 
is shown in Fig. (6). The behavior presented in Fig. (6a) shows that after an initial rapid 
collapse (on the time scale of about 100 — 200r^) the chain explores intermediate state (where 
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x(t) is roughly constant for a large fraction ti/tu of time, where 77 is the life-time of the 
intermediate state) before reaching the native conformation at t Vi = 3026r L . Fig. (6b) shows 
another off-pathway trajectory for this sequence, in which native conformation is reached at 
1970r L . Although these slow trajectories are qualitatively similar, they clearly demonstrate 
that the chain samples different misfolded conformations depending on the initial conditions 
before it finally finds the native state. This fact further supports the multipathway character 
of the indirect folding process. After the native conformation is reached the overlap function 
fluctuates around the equilibrium value < x >— 0-26 or makes sudden jumps to the higher 
values of x ~ 0.4 and fluctuates around these values for a finite time. Such dynamics clearly 
reflects frequent visits to low lying structures (see Sec. (IV. D). The behavior shown in 
Fig. (6) is very typical of the trajectories that reach the native conformation via indirect 
mechanisms which are conveniently quantified in terms of TSMM. Fig. (7) presents a typical 
indirect trajectory for fast sequence I, which has the partition factor $ slightly less than 
unity. This trajectory reaches the native conformation at 238Ar L . 

It is also instructive to compare the dynamical behavior of the nucleation trajectories 
of fast and slow folding sequences. An example of a trajectory that reaches the native 
conformation via nucleation collapse mechanism for sequence E is shown in Fig. (8). It is 
important to note that the qualitative behavior of x{t) presented in Fig. (8) is very similar 
to that shown in Fig. (5). This further confirms that the underlying mechanism that leads 
the chain directly to the native conformation for sequences with large a is similar to the 
nucleation process. The only difference is that the partition factor $ is less for sequences 
with large a than for ones with small a. Fig. (8) also indicates that after reaching native 
state the chain makes frequent visits to neighboring misfolded conformations and, in some 
instances, gets trapped in these for relatively long times. 

The kinetic behavior described above suggests that the value of a can be used to classify 
sequences according to their ability to access the native state. It appears that not only 
does a correlate well with the intrinsic kinetic accessibility of the native conformation it also 
statistically determines the kinetic partition factor $. In Fig. (9a) we show the dependence 
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of $ on a for the nine sequences. The trend which emerges from this plot is that the 
sequences with larger values of a (and consequently with larger Tmfpt) have smaller values 
of $. For example, for the slow folding sequence labeled A with the largest value of a = 0.65 
the fraction of fast trajectories is $ = 0.43. In contrast, the fastest folding sequence labeled 
I (er = 0.14) for which biexponential fit of P u (t) is needed, has the value of $ > 0.9. 

D. Probes of kinetic and equilibrium intermediates using inherent structures: Roles 

of NBA and CBA 

The question of the nature and relevance of intermediates in protein folding is of abiding 
interest. Our studies here and elsewhere p2[^3j have demonstrated that the scenarios for 



folding can be conveniently classified in terms of a provided the foldable sequences are 
compared in a similar manner. In order to probe the role of intermediates in the approach 
to native state we have analyzed three sequences (E, G, and I) using the kinetic order 
parameter profiles. Sequences G and I are classified as fast folders (the partition factor <3> 
exceeds 0.9) while sequence E is a moderate folder with the associated a ($ = 0.72) lying 
in the boundary between fast and slow folding sequences. 

We analyze the role of kinetic and equilibrium intermediates (defined below) using the 
following methodology. Each trajectory is divided into a kinetic part and an equilibrium 
part. The kinetic part of a trajectory labeled i includes the portion from the beginning till the 
native state is reached for the first time, namely, the first passage time, Tu. The equilibrium 
part is taken to be the remaining portion of the trajectory from Tu till T max . For convenience 
we take T max to be the same for all trajectories. In order to characterize the nature of 
intermediates we use the overlap function, x, which, as described earlier, gives the degree of 
similarity to the native conformation. It is possible that the same value of % may correspond 
to different conformations and in some instances to conformations that are even structurally 
unrelated to each other. However, by studying the distribution of overlap function over a 
range of \ f° r several independent initial conditions and by directly comparing the resulting 
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conformations and calculating x between them we can ascertain the states that are visited 
with overwhelming probability before and after reaching the native conformation. In order to 
probe the nature of kinetic and equilibrium intermediates that the chain samples en route to 
the native conformation we have determined the "inherent" structures |jB| (see below). The 
inherent structures are obtained from the time course of examples of which are shown 
in Figs. (5-8). The basins of attractions are obtained before the chain reaches the native 
conformation for the first time (i.e., the "kinetic" basins) and are determined as follows. As 
the polypeptide chain approaches the native conformation (but has not yet reached it, i.e., 
t < Tu), we record several (usually about 10) instantaneous conformations which serve as 
initial conditions for steepest descent simulations. In this method the temperature is set to 
zero and the velocities of all residues are rescaled to zero after each integration step. This 
results in a " downhill" motion of a sequence on the energy surface. The final conformations 
of the steepest descent quench simulations (provided they are sufficiently long) are the 
conformations of local energy minima (inherent structures) which the sequence explores in 
the folding process. These conformations obtained at different times and with distinct initial 
conditions allow us to map the distribution of folding pathways. The same technique for 
getting inherent structures was used after the first passage time Tu < t < r max as well. 
These would give us the "equilibrium" intermediates. This analysis allows to compare the 
nature of intermediates in the kinetic pathways. 

For sequence G we determined the inherent structures using the instantaneous conforma- 
tions labeled (1-6) (all of which occur at t < Tu) shown in Fig. (5b). The inherent structures 
for this particular trajectory (and for others as well) almost always coincide with the native 
state. This clearly shows that for sequence G, for which the native state is reached by the 
nucleation collapse mechanism, the various inherent structures directly map into the native 
basin of attraction (NBA). The rapid approach to the NBA is the reason for the two state 
kinetics displayed. It also follows that the NBA is relatively smooth, i.e. the energy fluctua- 
tions characterizing the roughness is comparable to fceT,. The roughness associated with the 
NBA implies that the polypeptide chain spends a finite amount of time in close proximity 
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(x(t) ~< X >) to the native conformation prior to reaching it. It is worth emphasizing that 
this sequence (a = 0.20) fluctuates only around the native state even for t > t u for all the 
trajectories examined. 

Of the nine sequences we have examined I is the fastest folder, i.e., it has the smallest 
folding time. Nevertheless, the partition factor $ is slightly (but measurably) less than unity. 
The amplitude of the slow component is very small (for this sequence the biexponential fit to 
P u (t) suffices). These observations suggest that the underlying topography explored could 
be somewhat different from that of sequence G which is also a fast folder. Most of the 
trajectories reach the native state for sequence I rapidly without forming any intermediates 
and resemble the behavior shown in Fig. (5) for sequence G. However, there are "off- 
pathway" trajectories for this sequence an example of which is shown in Fig. (7). The 
inherent structures at the kinetic part for this particular trajectory (t < Tu) were determined 
using the conformations labeled (1-6) (Fig. (7)). In addition, the inherent structures were 
also calculated using the conformations (7-12) that the chain samples after the first passage 
time for this trajectory. We found that these inherent structures are all identical and differ 
very slightly (as measured by the overlap function). Consequently we characterize them as 
native-like intermediates. This sequence, although is a fast folder, has at least one competing 
basin of attraction (CBA) in which the structure is quite similar to the native state. Since 
there is a small fraction of molecules that reach the CBA prior to reaching the NBA the 
$ value is smaller than unity. The comparison between sequences G and I, both of which 
folds very rapidly, shows that there can be significant differences in the underlying energy 
surface. This is further illustrated in Sec. (IV. E). 

According to our classification sequence E is at least a moderate folder and exhibits the 
full range of the kinetic partitioning mechanism ($ = 0.72). There is a significant component 
of initial trajectories that reach the native state via three stage multipathway mechanism. 
Examples of these off-pathway trajectories are shown in Fig. (6). We have obtained the 
inherent structures using conformations labeled (1-6) in Fig. (6a) (that occur before tu) 
and using the conformations labeled (7-20) (that are sampled for times greater than tu). 
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It is found that these structures are nearly the same (excluding structure (6)) indicating 
that, in this instance, the polypeptide samples native-like intermediates en route the native 
conformation. In this sense the behavior for this trajectory is no different from that observed 
for off-pathway trajectories for sequence I (Fig. (7)). 

The result of a similar analysis using another trajectory shown in Fig. (6b) is dramatically 
different. The inherent structures obtained using the conformations labeled (1) and (2) and 
(3-7) are completely different from each other. Furthermore, the equilibrium intermediates 
identified with the inherent structures obtained using the instantaneous conformations (11- 
17) do not resemble those calculated during the kinetic portion (1-7). We do find that 
the equilibrium intermediates for this trajectory (11-17) are virtually identical to those 
calculated using the conformations (CBA's) sampled by other trajectories displayed in Figs. 
(6a, 8). Examination of other off-pathway trajectories reveal the presence of an exceptionally 
stable intermediate with x ~ 0.8. In fact, this intermediate survives for 90, OOOtl, while a 
typical first passage time is only about lOOOr^. Such intermediates described above are 
never visited again after folding is completed and hence they are kinetic intermediates. 

These observations imply that for moderate and slow folders there are several competing 
basins of attraction. Some of these serve as equilibrium intermediates, i.e., these have 
native-like characteristics and the chain revisits them even after reaching the native state. 
Others, which occur relatively early in the folding process, perhaps during the initial collapse 
process itself, are kinetic intermediates that are not visited after the native state is reached, 
at least during the time course of our simulations. Thus, for moderate and slow folders 
one has a distribution of CBA's. The presence of CBA's provide the entropic barriers to 



folding |46j resulting in slow approach to the native state. In contrast, for fast folders the 
only intermediates that are encountered, if at all, are all native-like. Thus, for fast folding 
sequences only the NBA dominates. In such cases the energy landscape can be thought of 
as being funnel-like P,|2T| . 
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E. Free energy profiles 



The analysis in the preceding subsection indicates that the free energy profile can be 
quite complex. The shapes of these profiles depend crucially on the sequence and external 
conditions (in our simulations that is specified only by the temperature). We have attempted 
a caricature of the free energy surface by computing the histogram of states expressed in 
terms of the potential energy E p and \. The histogram of states, which measures the 
probability of occurrence of the state with a given E p and Xi is defined as 

g(Ep, x) = tt E / S(E P - E P MKX - Xi(t))ds, (30) 

M j—l T max T\i Jtu 

where E p ^(t) and Xtif) are the values of potential energy and overlap function for the trajec- 
tory i at time t averaged over a small interval of 5tl- We have calculated g(E p , x) for three 
sequences at the sequence dependent simulation temperature T s . The values of M — 100, 
a grid size of 0.1 is used for E p and x is increased in intervals of 0.01. If T max » t u then 
Eq. (pDD gives the equilibrium distribution function. A free energy profile may be illustrated 
using the potential of mean force defined as 

W(E p ,x) = ~k B TJn[g(E p , X )}. (31) 

In Figs. (10-12) we plot g(E p ,x) for the three sequences. The bottom panel in each of 
these figures shows the contour plot of the histogram of states. For sequence G (Fig. (10)) 
it is clear that the NBA is the only dominant maximum and consequently the kinetics on 
this surface is expected to be two state-like. The plots in Fig. (10) for sequence G also show 
that after the NBA is located the chain only fluctuates in the NBA. The free energy profile 
for sequence I (as suggested by Fig. (11)) has in addition to the NBA at least one CBA. 
The presence of the CBA makes $ smaller than that for sequence G, for which $ = 1.0. 
Proteins with a larger a would have several CBA's. This is clearly indicated in Fig. (12) 
for sequence E which shows that there are two discernible CBA's which makes this model 
protein only a moderate folder. The profile of the potential of mean force for this sequence, 
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computed using Eq. (|3~l"D, is shown in Fig. (13). This figure shows that in general one has a 
complex structure for the free energy profile. It is also clear that this multivalley structure 
naturally leads to the KPM discussed in Sec. (IV. C). These figures also show that in special 
cases (small values of a) the folding kinetics can be described in terms of only the NBA or 
folding funnel 



F. Dependence of t f on a 

It is clear from the results discussed above that the parameter a (for a given external 
condition, which in our case is the simulation temperature) may be used to predict approx- 
imate kinetic behavior of various sequences. The folding time Tp , which is taken to be the 
mean first passage time r MFPT , is plotted as a function of a in Fig. (14a). This graph shows 
a remarkable correlation between these Tp and a. The sequences with small a < 0.4 fold to 
the native conformation very rapidly, so that Tp is less than about 600r^. However, Tp for 
the sequence with largest a = 0.65 is as large as 875258r^. Thus, variation of the parameter 
a from 0.14 to 0.65 results in three orders of magnitude increase in the folding time (from 
461t l to 875258r L ). It must be noted that the correlation between t f and a should be 
considered as statistical. One can easily notice few pairs of closely located data points in 
Fig. (14a), for which larger value of a does not correspond to larger Tp. Nevertheless, the 
general conclusion following from Fig. (14a) remains apparent: the parameter a allows us 
to predict the trend in the folding rate of the sequences by knowing only its thermodynamic 
properties, such as T$ and Tp. It should also be pointed out that because of the difficulty 
in computing the low energy spectra of the off-lattice models [|59,[B0"| correlations between 



folding times and other quantities (such as the energy gap or the relative value of the native 
energy compared to that of non-native conformations) were not tested. In addition, there 
appears to be no unambiguous way to determine the kinetic glass transition temperature, 
T g ,kin- Therefore, we have not tested the proposal that foldable sequences have large values 

T F /T gMn [g. 
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In order to study the dependence of the folding time on the parameter a we used the 
function P u {t) and defined folding time as the mean first passage time tmfpt (see Eq. fl24|)). 
It was already mentioned above that the alternative way is to analyze the overlap function 
< x{t) > an d take the largest exponent T2 in the exponential fit to < > as an estimate 
for the folding time. Due to computational limitations we did this only for five sequences 
and found the trend similar to that illustrated in Fig. (14a), i.e the folding time T2 correlates 
remarkably well with the parameter a. 

G. Kinetics and folding times at moderate friction 

The results presented above have been obtained with the value of friction coefficient fixed 
at (l — 0.05. In order to study the dependence of the folding kinetics on ( we have performed 
the same study of nine sequences at a larger value of the friction coefficient Cm = 5 = IOOCl- 
The plot showing the folding time t f = tmfpt as a function of the parameter a at Cm 
is displayed in Fig. (14b). In accord with the results obtained at the lower value of (l 
this figure also unambiguously demonstrates a good correlation between a and t>, so that 
the sequences with small values of a fold much faster than the sequences, having large a. 
Specifically, the sequence labeled I, which has the smallest value of a = 0.14, reaches the 
native conformation very rapidly within rp = 1554r^, while the sequence labeled A with 
a = 0.65 folds very slowly within Tp = 2.4 x lO 6 ^. As one may expect the overall folding 
times in the moderate friction limit are considerably larger than in the low friction limit. 
The folding times vary almost linearly with (. For most sequences the ratio tf{(m)/t~f((l) 
is 3 - 4. The largest value of this ratio is found for the slow folding sequence labeled D and 
is equal to 5. 

In order to compare the folding kinetics at Cm with those obtained at (l we analyzed 
several folding trajectories. Fig. (15) presents typical folding trajectory (in terms of the 
overlap function x{t)) f° r t ne sequence G, which displays two state kinetics and is classified 
as a fast folder. This figure shows that after few tertiary native contacts are established 
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the chain rapidly reaches the native state. In Fig. (16) we plot for typical slow (upper 
panel) and fast (lower panel) trajectories for the sequence E, which, in contrast to sequence 
G, exhibits KPM and is classified as a moderate folder. It is seen that the fast trajectory for 
the sequence E is very similar to a typical trajectory for the sequence G. The reason for this 
is that the underlying mechanism for the fast process, namely NCNC mechanism, is exactly 
identical. It is also very important to note that similar plots for these two sequence (Figs. 
(5-6,8)) obtained at (p are virtually the same as those shown in Figs. (15,16). This allows 
us to suggest that principal mechanisms of protein folding, such KPM, nucleation collapse, 
appear to be independent of the viscosity of surrounding medium. The time scales and the 



kinetic partition factor $, however, depend critically on viscosity p2[ . 

The classification of sequences into slow and fast folders based on the parameter a can 
also be carried out with the larger value of Cm- Fast folding sequences (4 out of 9) are 
characterized by the values of a ^ 0.4. The mean first passage time for fast folders tmfpt is 
below 3000r L . The function P u (t) for fast folders is adequately fit (apart from one sequence) 
with single exponential just as in the low friction limit. Thus folding of these sequences 
proceeds via nucleation collapse mechanism. The sequences with a ^ 0.4 can be classified 
as slow or moderate folders. These sequences have significantly larger mean first passage 
times tmfpt ranging from 3285tl to 2.4 x 10 6 T£. Most importantly, the fraction of unfolded 
molecules P u (t) is clearly two or three exponential (see Eq. (|26"D ) which is an apparent 
manifestation of KPM. As for the low friction limit the fraction of fast folding trajectories 
$ increases as the parameter a decreases (Fig. (9b)). Specifically, for the sequence A 
(a = 0.65) $ = 0.47, while for the fastest folding sequence I (a = 0.14) the fraction of fast 
trajectories becomes as large as 0.93. 

H. Quantitative dependence of tf on a 

It is interesting to comment on the quantitative dependence of Tp on a . Theoretical 
arguments suggest that, at least at small values of a, Tp should scale algebraically with a, 
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i.e. tf ~ u e with 6 = 3 [p2 ]. The present simulations as well as previous studies using lattice 



models suggest [ 24 , 48 1 that the data can also be fit with an exponential, i.e. 



T F ~r F(iV)exp(-) (32) 

where <7o depends on the value of friction and F(N) is a function that depends on N. It has 
been argued fH that F(N) ~ N u with 3.8 < u < 4.2 for a « and F(JV) ~ exp( v / iV) for 
larger a. The data in Fig. (14) can be fit with Eq. (^) with o"o ~ 0.06 at Cl and Cm- The 
fit of Tp to an algebraic power [jp ~ a e ) gives ~ 3.9 at (l and Cm- Further work will be 
needed to fully quantify the precise dependence of Tp on a. It appears that both Eq. fl32|) 



and the algebraic behavior [22] account adequately for the data given here and elsewhere 



for lattice models. The fit given in Eq. fl3~2|) appears to be a bit more accurate. 

V. IMPLICATIONS FOR EXPERIMENTS 

The results presented here together with the time scale estimates given in the Appendix 
have a number of implications for experiments. Here we restrict ourselves to providing some 
comparisons to the folding of chymotrypsin inhibitor 2 (CI2) which was probably the first 
protein for which a kinetic two state transition was established [^,^T|]. These experiments 
established that the kinetics for the fast phase, which corresponds to the molecules with 
proline residues in a trans conformation, follows a two state behavior. Furthermore the 
thermodynamics also displays a two state cooperative transition with the native conforma- 
tion being stable by about 7 kcal/mol at T = 25°G, pH = 6.3 and at zero denaturant 



concentration. Although not explicitly addressed here we have argued elsewhere |22[ that 
the marginal stability (relative to other structurally unrelated conformations) of the native 
state of proteins satisfies 



AG > k B TVN, (33) 

where the unknown prefactor is assumed to be of the order of unity. The CI2 examined by 
Jackson and Fersht has 83 residues and consequently Eq. (|33D gives AG ~ 5.5 kcal/mol at 
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T = 25°C. This is in fair agreement with the experimental determination. It appears that 
Eq. ( |33"D is consistent with the marginal stability of proteins of varying size. The bound 
given above seems to be a good estimate of the stability of biomolecules || . We expect the 
scaling relation of the type given in Eq. (^) to be accurate to only within a factor of two. 
Given that there is inherent experimental uncertainty in determining AG the agreement 
with the theoretical prediction within roughly 20 percent is remarkable. 

The kinetics of folding of CI2 can be rationalized using the ideas developed here. The time 
scale for native conformation nucleation collapse according to Eq. (|A5|) is t^cnc ~ 0.2 ms 
using the parameters specified in the Appendix and with a ~ 0.4 (we have taken Tg w Q0°C 
and Tp ~ 37°C). If we assume that the folding time changes exponentially with a (cf 
Eq. (|3~2"D) then the estimate for the nucleation collapse time changes to about 25 ms, 
where we have used o§ ~ 0.1. These estimates give an interval (a relatively broad one) 
0.2 ms ;$ tncnc ~ 25 ms. Despite the uncertainties in the theoretical estimates (unknown 
prefactors, errors in the estimates of 7, a etc.) the estimated values of t NC nc are within 
measured experimental values. The early experiments and more recent ones on CI2 and a 
mutant of CI2 indicate that the folding time for t^cnc is i n the range of (1.5 — 18) ms 



The fastest folding time of 1.5 ms is found for a mutant of CI2 |56|. Our theoretical 



estimates show that even if the external conditions are constant and the length of the 
polypeptide chain is fixed t^cnc can still be altered if a (see Eq. (|X5|)) is altered. Since 
a is very sensitive to sequence we suggest that the mutant of CI2 has a different value of 
a than the wild type. This can readily explain the decrease in folding time for the mutant 
under otherwise similar external conditions. Further work is needed to quantify these ideas. 

VI. CONCLUSIONS 

The folding of proteins is a complex kinetic process involving scenarios that are not 
ordinarily encountered in simple chemical reactions. This complexity arises due to the 
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presence of several energy scales and the polymeric nature of polypeptide chains. As a result, 
this complexity leads to a bewildering array of time scales that are only now beginning to 
be understood quantitatively in certain minimal models of proteins fl22| , |23f . Despite this 
remarkable complexity it has been known from the pioneering studies of Anfinsen that the 
specification of the primary sequence determines the three dimensional structure of proteins, 
i.e. native state topology is encoded in the primary sequence. The study presented here 



as well as our earlier work on lattice models p4|j43|| have shown clearly how the kinetic 
accessibility is also encoded in the primary sequence itself. Our results suggest that a wide 
array of mechanisms that are encountered in the folding process is, remarkably enough, 
determined by a simple parameter expressible in terms of the properties that are intrinsic 
to the sequence but affected by external conditions. It appears that the two characteristic 
equilibrium temperatures Tg and Tp determine the rate at which a given sequence reaches 
the native conformation. Tg and Tp not only depend on the sequence but also can be 
dramatically changed by varying the external conditions such as pH, temperature etc. Thus, 
the mechanism for reaching the native conformation for a single domain protein can change 
dramatically depending on the external conditions. This implies that a protein that exhibits 
two state kinetics under given external conditions does not necessarily follow the same 
kinetics, if the ambient conditions (e.g., pH) are altered. 

Our results show that generically the polypeptide chain reaches the native conformation 
by a kinetic partitioning mechanism (KPM). For a number of sequences studied here we 
have established that for given external conditions (for the computational studies it is the 
temperature only) a fraction of molecules $ reaches the native conformation directly via 
nucleation collapse mechanism, while the remainder follows a complex three stage multi- 
pathway kinetics. For both values of friction coefficient studied here this general scenario 
holds. 

It is clear from our results that once the external conditions are specified $ is essentially 
determined by the interplay of Tg and Tp as embodied in Eq. (|I]). The folding time correlates 
extremely well with the dimensionless parameter a = (Tg — Tp) /Tg independent of the value 
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of the external friction. The remarkable correlation between a and several kinetic properties 
lends credence to the notion that in small proteins at least a single collective coordinate 



description of folding may suffice ||54|| . It also follows from this study that only when a 
is small can folding be described in terms of NBA. For moderate and slow folders it is 
important to consider the interplay between NBA and CBA in determining folding kinetics. 
The independence of our general conclusions on the type of models (lattice versus off-lattice) 
|24] , |43H and on the details of the dynamics (Langevin dynamics or Monte Carlo) seems to 
indicate that the kinetic partitioning mechanism (along with a determining the trends in 
folding times) may describe in a concise fashion the scenarios by which single domain proteins 
reach the native conformation. 

There are quantitative differences between the results obtained for lattice and off-lattice 
models. For example using simulations of lattice models it was concluded that fast folders 
(with $ ps 1.0) have values of a less than about 0.15 |43|]. The off-lattice models suggest 
that fast folders can have a as large as about 0.4. Since the estimates of Tg and Tp using 
the off-lattice simulations appear to be in better accord with experiments it is tempting to 
suggest that for semi-quantitative comparison with experiments it is better to use off-lattice 
simulations. 
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APPENDIX A: 

In this appendix we map the natural time units to real times so that an assessment 
of the folding times for these minimal models as well as for small sized proteins can be 
made. In addition using a mapping between these models and proteins estimates for folding 
times for proteins with small number (< 70) of amino acids are also presented. We expect 
these estimates to be accurate to within an order of magnitude due to large uncertainties 
in the estimates of various quantities as well as a lack of theoretical understanding of the 
conjectures. From the equation of motion (see Eq. (pTTj) ) it is clear that when the inertial 
term dominates the natural unit of time is tl = (ma 2 /^) 1 ' 2 . Typical values of mo and ao for 
amino acid residues are 3 x 10~ 22 g and 5 x 10 _8 cm, respectively. These are the masses and 
the Van der Waals radius of the amino acid residues. The hydrophobic interaction energy 
€h is of the order of lkcal/mol or 7 x 10~ 14 erg. If these values are changed by factor of two 
or so there will be not a significant change in our conclusions. Assuming that a bead in our 
model roughly represents one amino acid we evaluate tl as 




The value of the low friction coefficient used in our simulations (l = 0.05i7i/tl = 5x 10~ 12 g/s 
while the value of (m = IOOCl = 5 x lCr 10 g/s. It is interesting to compare these values 
for ( to that obtained in water which has at room temperature T = 25°C a viscosity of 
O.OlPoise with lPoise being equal to 1 g/(s ■ cm). The friction on a bead of length a may 
be estimated as 

(water ^ ^T] water aQ « 9 X 10~ 9 #/s. (A2) 

From this we get (l/ (water ~ 10~ 3 , while (m/ (water ~ 0.1. The low friction would correspond 
to the energy diffusion regime in the Kramer's description of the unfolding to folding reaction. 
In the moderate friction there could be a competition between inertial and viscous damping 
terms leading perhaps to the Kramer's turnover regime familiar in the literature on simple 
reactions. 
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In the overdamped limit the inertial term can be ignored and the natural measure for 
time is 

(a 2 Qrnja 3 e h 
t h ~ — — ~ — — = «r L — — , A3 
k B T s k B T s k B T s 

where a is a constant. In our simulations a = 0.05 for ( L and a = 5.0 for ( M . The typical 
value of eh/k B T s is about 2, where once again we have used = Ikcal/mole and taken T s 
to be the room temperature. For water at room temperature r B ph 3ns with a ~ 100. If we 
assume that the higher value of friction used in this study is in the slightly overdamped limit 
then the approximate time unit becomes tm ~ 0.3ns. Since the higher value of friction is 
more realistic we can estimate the folding times for small proteins using the computed time 
scale. For a = 5.0 our simulation results give the folding time ranging from 100 tm to 10 6 tm- 
The folding time for the case of higher friction (with a exceeding 5) also ranges from 10 3 r^ 
(for the smallest a) to 10 6 tm (for the largest a) (Klimov and Thirumalai, unpublished). A 
naive estimate using these results would suggest that the folding time for these sequences 
ranges from 10~ 6 s to 10 _4 s. 

A better estimate of these times can perhaps be made by recognizing that each bead in 
the minimal model corresponds to a blob containing g number of amino acids [^j. If the 
structure within a blob is represented by roughly spherical size a than we can use a ~ g u a>o, 
where g v is the "swelling factor" mapping the minimal model to real proteins. Then the 
natural time unit for the motion of such a blob in the overdamped limit becomes 

fa^o| =j3 „ TH (A4) 

k B T 

The range of v is | — 1, with v — | corresponding to globular structure within a blob and 
v — 1 corresponding to maximum repulsion among the residues in a blob. This should be 
viewed as a guess and is not expected to be correct given that g is small. Realistic values of g 
are expected to be between 2 and 3 making the 22-mer minimal model to (perhaps) 

correspond with 44-66 amino acid residue proteins. For g = 2, ranges from 0.4ns to 
1.6ns, while for g = 3 t§ ranges from 0.4ns to 5.4ns. Assuming that g = 3 and v — 1 
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(which would give the largest time scales ) the folding estimates for small proteins (number 
of amino acids smaller than 70) range from 8 x 10~ 6 s (for small a) to 10ms (for large a). 

This exercise suggests that no matter how the mapping is done the most relevant time 
scale for folding kinetics of small proteins under normal folding conditions (around room 
temperature and low denaturant concentration) is between microseconds to milliseconds. 
In particular, for those proteins that reach the native conformation predominantly via the 
nucleation collapse process (characterized by relatively small a) the time scale for folding is 



between microseconds to milliseconds for small proteins. One of us has argued |22j that the 
time scale for the NCNC process is given by 

tncnc ^ ^* 3 N» (A5) 
7 

where u is in the range of 3.8 to 4.2. There is usually a large uncertainty in the surface tension 
7 between the hydrophobic residues and water. The range for 7 is between 25 — 75 cal/(A 2 ■ 
mol). The largest time scale to Eq. (^3) emerges when w ~ 4.2 and 7 ~ 25 ca//(A 2 ■ mol). 
Using 77 ~ 0.01 Poise, a ~ 0.2, and a ~ 5 x 10~ 8 cm t^cnc ranges from 10 _6 s to 0.1ms 
as N varies from 22 to 66. The values based on theoretical arguments (cf. Eq. (K5)) are 
consistent with the numerical estimates based on the simulations. 

It is interesting to compare the estimates for the fast process, corresponding to the 
NCNC mechanism, obtained using Eqs. (|A5| , [A4|) and simulation results with experimental 
results. All the theoretical estimates yield t^cnc ~ 0.1 ms. The recent experiments on 
chymotrypsin inhibitor 2 suggest that the time scale for the nucleation collapse process is in 
the range of 1.5 — 15 ms |56| depending on external conditions. The experimental times are 
not inconsistent with our simulation results and theoretical estimates given the uncertainty 
in the values of the various parameters. Our studies further underscore the importance of 
processes relevant for folding of proteins in submillisecond time scale especially for the NCNC 
process. Further experiments on these time scales are needed for an explicit experimental 
demonstration of the native conformation nucleation collapse mechanism [pl|-p5[|. 
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FIGURE CAPTIONS 

Fig. 1. The /3-type native structure of the sequence LB 9 (N L) 2 N B LB 3 LB labeled G. 
In the turn region chain backbone adopts gac/ie-conformations. The hydrophobic beads are 
given by a blue color, the hydrophilic beads are represented by red, and the neutral beads 
are shown in grey. 

Fig. 2. The temperature dependence of the thermodynamic quantities for sequence G 
calculated using slow-cooling method: (a) overlap function < xCO >> C°) fluctuations of 
the overlap function A%(T); (c) energy < E(T) >; (d) specific heat C V (T). The peaks in 
the graphs of Ax{T) and C V (T) correspond to the folding transition and collapse transition 
temperatures, T F and Tq. 

Fig. 3. The dependence of the simulation temperature T s , as defined by Eq. (0), on 
the parameter a = {T e — T F )/T e . 

Fig. 4. The fraction of the unfolded molecules P u {t) as a function of time for the 
sequences G (a) and A (b). Time is measured in the units of tl (cf. Eq. (|Al|)). The 



solid line in Fig. (4a) is a single exponential fit to the data. This implies that for this 
sequence folding is kinetically a two state process ($ = 1.0). The solid line in Fig. (4b) is a 
three exponential fit to the data. The multiexponential process is indicative of the kinetic 
partitioning mechanism with $ = 0.43 (see Eq. (pB"l)). 

Fig. 5. Dynamics of a typical fast-folding trajectory as measured by x(t) (sequence G) 
at (l. It is seen in Fig. (5a) that on a very short time 380r^ the native conformation is 
reached. After the native conformation is reached x(t) fluctuates around the equilibrium 
value < x >■ (b) Another trajectory for this sequence, for which inherent structures at the 
times (1-6) were determined. Horizontal arrow in this plot indicates the region of native basin 
of attraction (NBA). It is seen that the chain approaches NBA but spends finite amount of 
time there before reaching the native state at 1525r^. Dashed line indicates < x > = 0-26 
at T s . Vertical arrows indicates the first passage time. 

Fig. 6. Examples of two trajectories that reach the native conformation by an indirect off- 
pathway process. These trajectories are for sequence E at (l- The kinetics exhibited by the 
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off-pathway process suggests that the native state is reached by a three state multipathway 
mechanism. Fig. (6a) shows that after initial rapid collapse on the time scale of (100— 200)t£ 
the chain gets trapped in misfolded compact structure (indicated by nearly constant value 
of x f° r l° n g times). In this case the native state is eventually reached at ss 3026r L . Fig. 
(6b) which is given for a different slow trajectory shows that the chain samples at least 
two distinct misfolded structures before the first passage time is attained at ~ 19707£. In 
both figures numbers indicate the points where inherent structures were determined. The 
time course of x(t) reveals that the chain samples a number of kinetic and equilibrium 
intermediates. It was found that inherent structures (1-5,7-20) (Fig. (6a)), (11-17) (Fig. 
(6b)), and (1-5) (Fig. (8)) are almost identical and are accessible before and after first 
passage time. For this they are classified as native-like equilibrium intermediates. However, 
the inherent structures (1-2) and (3-7) (Fig. (6b)) are examples of kinetic intermediates. 
Dashed lines in these plots indicate the equilibrium value of < x >— 0-26 at T s . Horizontal 
arrows indicate the regions of CBA's. Vertical arrows indicate the first passage time. 

Fig. 7. Dynamics of one of the few off-pathway trajectories for sequence I (<3> = 0.95). 
Inherent structures are determined at the points marked by the numbers. Analysis shows 
that structures (1-6) and (7-12) are identical that allows us to refer to them as equilibrium 
native-like intermediates. Note that the vast majority (~ 0.95) of trajectories fold via NCNC 
mechanism. The native state is reached at 2384t l . Horizontal arrows indicate the regions 
of CBA's. Vertical arrow indicates the first passage time. 

Fig. 8. This figure shows the dynamics of x{t) for one trajectory for sequence E (with $ = 
0.72) that reaches the native conformation rapidly. In this example the native conformation 
is attained at 325tl. Comparison of this figure with Fig. (5) (for sequence G with $ > 0.9) 
shows that the dynamics is very similar. This implies that the underlying mechanism (NCNC 
mechanism) of fast folding trajectories of sequences with large a (or equivalently small <£>) is 
similar to that by which the molecules reach the native state in kinetically two state manner. 
Horizontal arrow indicates the regions of CBA's. Vertical arrow indicates the first passage 
time. 
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Fig. 9. Correlation between the fraction of fast folding trajectories $ and the parameter 
a = (T g — T F )/T e . Most sequences with small a have $ 1.0. Vertical dashed line shows 
classification of sequences with respect to $. (a) is for low friction value, (b) corresponds 
to moderate friction. These sequences with $ > 0.9 are classified as fast folders. The 
classification of sequences into slow category is somewhat arbitrary. The classification does 
not seem to depend on the value of (. 

Fig. 10. (a) Histogram of states g as a function of two variables, E p and x, is given for 
sequence G. (b) The contour plot of the histogram of states g for this sequence. Lighter 
areas correspond to peaks of g. Single peak of the histogram of states suggests that at 
equilibrium the chain is completely confined to a native basin of attraction. 

Fig. 11. Histogram of states g and contour plot of g for sequence I. These plots reveal two 
peaks of the histogram of states that manifests the presence of competing basin of attraction 
which makes the partition factor $ less than unity. 

Fig. 12. Histogram of states g and contour plot of g for sequence E. One can clearly see at 
least three maximums of g. These plots illustrates the existence of several competing basins 
of attractions (intermediates) that gives rise to complex folding kinetics which features a 
combination of three stage multipathway and nucleation collapse mechanisms. 

Fig. 13. The profile of the potential of the mean force W in terms of two variables, 
E p and for sequence E. This further illustrates that the free energy landscape of this 
sequence features multiple funnels (basins of attractions). The plane at W = 3.8 is given 
for eye reference. 

Fig. 14. Dependence of the folding time r F on the parameter a = (T e — T F )/T e . It is 
seen that t f correlate remarkably well with a, so that sequences with small value of a reach 
native state very rapidly, whereas those characterized by large a fold slowly. Solid lines 
indicate exponential fit the the data. The actual fit to the data is discussed in the text. Fig. 
(14a) is for the low friction value and Fig. (14b) shows data for the moderate friction limit. 

Fig. 15. An example of a fast folding trajectory that reaches the native conformation 
by a NCNC process. This is for sequence G at (m- This figure shows that the dynamics of 
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the fast folding trajectory is qualitatively similar to that obtained at ( L (see Fig. (5)). In 
both cases the native conformation is reached rapidly following the formation of a critical 
number of contacts (nucleus) and collapse. The first passage time for this trajectory is 603tl. 
Dashed line gives the equilibrium value of < x >— 0-26 at T s . Vertical arrow indicates the 
first passage time. 

Fig. 16. (a) An example of a slow folding trajectory as recorded by x(t) (sequence E) 
at (m- After initial rapid collapse on the time scale < IOOOtl the chain samples various 
compact conformations and finally reaches the native state at 12919tl as indicated by an 
arrow. This trajectory shows that at least four distinct kinetic structures are sampled as 
the chain navigates to the native conformation. Dashed line in both panels indicates the 
equilibrium value of < x >— 0-26 at T s . (b) Typical fast folding trajectory for this sequence. 
This trajectory is similar to those characteristic of fast folders (see Fig. (15)). The native 
conformation is found very rapidly at 563tx as indicated by an arrow. The results displayed 
in Figs. (5-8) and Figs. (15,16) show that the qualitative aspect of the kinetic partitioning 
mechanism is not dependent on the friction coefficient. Vertical arrow indicates the first 
passage time. 
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TABLES 



TABLE I. The list of sequences and their parameters studied in simulations 



Sequence label 


sequence 


A 


T F 


rri 

Te 


a 


A 


B 9 N 3 (LB) 5 





0.20 


0.58 


0.65 


B 


Bs{NL) 2 NBLB 3 (LB)2 





0.30 


0.62 


0.51 


C 


Bgiy LJ\2LBLB 3 {LB)2 


0.3 


0.38 


0.72 


0.47 


D 


B 9 N 2 LNB(LB) 4 


0.17 


0.36 


0.62 


0.41 


E 


LB 7 NBNLN(BL) 2 B 3 LB 


0.1 


0.40 


0.66 


0.39 


F 


LB 9 (NL) 2 NBLB 3 LB 


0.3 


0.46 


0.76 


0.39 


G 


LB 9 (NL) 2 NBLB 3 LB 


0.3 


0.62 


0.78 


0.20 


H 


LB 9 (NL) 2 NBLB 3 LB 


0.3 


0.59 


0.80 


0.26 


I 


LBN B 3 LB 3 N 2 B 2 LBLB 3 LB 


0.3 


0.54 


0.62 


0.14 
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